Report the dimension of election.raw.
dim(election.raw)
## [1] 31167 5
election.raw
## # A tibble: 31,167 x 5
## state county candidate party votes
## <chr> <chr> <fct> <fct> <dbl>
## 1 Delaware Kent Joe Biden DEM 44518
## 2 Delaware Kent Donald Trump REP 40976
## 3 Delaware Kent Jo Jorgensen LIB 1044
## 4 Delaware Kent Howie Hawkins GRN 420
## 5 Delaware Kent Write-ins WRI 0
## 6 Delaware New Castle Joe Biden DEM 194238
## 7 Delaware New Castle Donald Trump REP 87685
## 8 Delaware New Castle Jo Jorgensen LIB 2932
## 9 Delaware New Castle Howie Hawkins GRN 1278
## 10 Delaware New Castle Write-ins WRI 0
## # … with 31,157 more rows
In election.raw dataset, there are 31167 rows and 5 columns.
Are there missing values in the data set?
sum(is.na(election.raw))
## [1] 0
No, there are no missing values in the data set.
Compute the total number of distinct values in state in election.raw to verify that the data contains all states and a federal district.
unique(election.raw['state'])
## # A tibble: 51 x 1
## state
## <chr>
## 1 Delaware
## 2 District of Columbia
## 3 Florida
## 4 Georgia
## 5 Hawaii
## 6 Idaho
## 7 Illinois
## 8 Indiana
## 9 Iowa
## 10 Kansas
## # … with 41 more rows
nrow(unique(election.raw['state']))
## [1] 51
There are 51 unique values in the 'state' column. This includes the 50 states in the United States of America, and includes the District of Columbia.
Report the dimension of census.
dim(census)
## [1] 3220 37
In census data set, there are 3220 rows and 37 columns.
Are there missing values in the data set?
sum(is.na(census))
## [1] 1
Yes, there is a missing value in the data set.
Compute the total number of distinct values in county in census with that in election.raw. Comment on your findings.
##unique county data
nrow(unique(census['County']))
## [1] 1955
unique(census['County'])
## # A tibble: 1,955 x 1
## County
## <chr>
## 1 Autauga County
## 2 Baldwin County
## 3 Barbour County
## 4 Bibb County
## 5 Blount County
## 6 Bullock County
## 7 Butler County
## 8 Calhoun County
## 9 Chambers County
## 10 Cherokee County
## # … with 1,945 more rows
##unique election.raw data
nrow(unique(election.raw['county']))
## [1] 2825
unique(election.raw['county'])
## # A tibble: 2,825 x 1
## county
## <chr>
## 1 Kent
## 2 New Castle
## 3 Sussex
## 4 District of Columbia
## 5 Ward 2
## 6 Ward 3
## 7 Ward 4
## 8 Ward 5
## 9 Ward 6
## 10 Ward 7
## # … with 2,815 more rows
There are 1995 unique counties in the census data set, while there are 2825 unique counties in the election.raw data set. ##COMMENT ON FINDINGS
Construct aggregated data sets from election.raw data.
election.state= election.raw %>% group_by(state, candidate) %>% summarise_each(funs(mean), votes)
## Warning: `summarise_each_()` is deprecated as of dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
election.state
## # A tibble: 351 x 3
## # Groups: state [51]
## state candidate votes
## <chr> <fct> <dbl>
## 1 Alabama Donald Trump 21405.
## 2 Alabama Jo Jorgensen 373.
## 3 Alabama Joe Biden 12589.
## 4 Alabama Write-ins 109.
## 5 Alaska Brock Pierce 7.42
## 6 Alaska Don Blankenship 8.7
## 7 Alaska Donald Trump 2025.
## 8 Alaska Jesse Ventura 22.8
## 9 Alaska Jo Jorgensen 87.0
## 10 Alaska Joe Biden 1144.
## # … with 341 more rows
election.total= election.raw %>% summarise_each(funs(sum), votes)
election.total
## # A tibble: 1 x 1
## votes
## <dbl>
## 1 150361237
How many named presidential candidates were there in the 2020 election?.
unique(election.raw['candidate'])
## # A tibble: 38 x 1
## candidate
## <fct>
## 1 Joe Biden
## 2 Donald Trump
## 3 Jo Jorgensen
## 4 Howie Hawkins
## 5 Write-ins
## 6 Gloria La Riva
## 7 Brock Pierce
## 8 Rocky De La Fuente
## 9 Don Blankenship
## 10 Kanye West
## # … with 28 more rows
There were 36 presidential candidates in the 2020 elections.
Draw a bar chart of all votes received by each candidate.
candid= election.raw %>% group_by(candidate) %>% summarise_each(funs(sum), votes)
cand1=unlist(candid[1],use.names=FALSE)
cand_num1= unlist(candid[2],use.names=FALSE)
#plot showing how many people voted for Joe Biden and Donald Trump, relative to other candidates
barplot(cand_num1, main='Total Votes per Candidate', xlab= 'candidates', ylab= 'total votes ', col='#69b3a2', names.arg=cand1)
This first chart shows the total sum of votes per candidate during the 2020 presidential election. Looking at this chart, it is apparent how popular Joe Biden and Donald Trump were compared to the other candidates.
#bar charts comparing all candidates
par(mfrow=c(1,3))
barplot(log(cand_num1[c(1:4)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE, names.arg=cand1[c(1:4)], col='#69b3a2', xlim = c(0, 20))
barplot(log(cand_num1[c(5:7)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE, names.arg=cand1[c(5:7)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(8:11)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE, names.arg=cand1[c(8:11)], col='#69b3a2',xlim = c(0, 20))
par(mfrow=c(1,3))
barplot(log(cand_num1[c(12:15)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE, names.arg=cand1[c(12:15)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(16:19)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(16:19)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(20:22)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(20:22)], col='#69b3a2',xlim = c(0, 20))
par(mfrow=c(1,3))
barplot(log(cand_num1[c(24:27)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(24:27)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(28:30)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(28:30)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(31:33)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(31:33)], col='#69b3a2',xlim = c(0, 20))
par(mfrow=c(1,2))
barplot(log(cand_num1[c(34,36)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(34,36)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(35, 37)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(35, 37)], col='#69b3a2',xlim = c(0, 20))
barplot(log(cand_num1[c(23,38)]), xlab= 'candidates', ylab= 'total votes (log)', horiz=TRUE,names.arg=cand1[c(23, 38)], col='#69b3a2',xlim = c(0, 20))
These charts show the total number of votes per candidate, in comparison of the other candidates. To make reading the bar charts easier, all bar charts have been put on a log-scale.
Create data sets county.winner and state.winner by taking the candidates with the highest proportion of votes in both county level and state level.
subs= election.raw%>% group_by(state,county) %>% mutate(total = sum(votes),pct= votes/sum(votes))
county.winner= subs %>% arrange(county, desc(pct))
county.winner= top_n(county.winner, 1)
## Selecting by pct
county.winner
## # A tibble: 4,481 x 7
## # Groups: state, county [4,478]
## state county candidate party votes total pct
## <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 South Carolina Abbeville Donald Trump REP 9343 14552 0.642
## 2 Maine Abbot Donald Trump REP 288 417 0.691
## 3 Massachusetts Abington Joe Biden DEM 5127 9550 0.537
## 4 Louisiana Acadia Parish Donald Trump REP 22596 28425 0.795
## 5 Virginia Accomack Donald Trump REP 9172 16938 0.542
## 6 Massachusetts Acton Joe Biden DEM 10793 13533 0.798
## 7 Maine Acton Donald Trump REP 930 1636 0.568
## 8 Massachusetts Acushnet Donald Trump REP 3193 5841 0.547
## 9 New Hampshire Acworth Joe Biden DEM 293 569 0.515
## 10 Idaho Ada Donald Trump REP 130699 259389 0.504
## # … with 4,471 more rows
state.winner= election.raw %>% group_by(state, candidate) %>% summarise_each(funs(sum), votes) %>% mutate(pct= votes/sum(votes))
state.winner= top_n(state.winner,1)
## Selecting by pct
state.winner
## # A tibble: 51 x 4
## # Groups: state [51]
## state candidate votes pct
## <chr> <fct> <dbl> <dbl>
## 1 Alabama Donald Trump 1434159 0.621
## 2 Alaska Donald Trump 80999 0.614
## 3 Arizona Joe Biden 1643664 0.495
## 4 Arkansas Donald Trump 761251 0.626
## 5 California Joe Biden 9315259 0.646
## 6 Colorado Joe Biden 1753416 0.553
## 7 Connecticut Joe Biden 1059252 0.593
## 8 Delaware Joe Biden 295413 0.588
## 9 District of Columbia Joe Biden 258561 0.926
## 10 Florida Donald Trump 5667834 0.512
## # … with 41 more rows
Draw county-level map by creating counties= map_data("county"). Color by county.
county <- map_data("county")
ggplot(data = county) +
geom_polygon(aes(x = long, y = lat, fill = subregion, group = group),
color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE)
Now color the map by the winning candidate for each state.
#to make both state/ column names are the same
colnames(state.winner)= c("region", "candidate", "votes", "pct")
state.winner$region= tolower(state.winner$region)
states <- map_data("state")
#joining dataframes
state_join= left_join(states, state.winner, by= 'region')
#map code
ggplot(data = states) + ggtitle("Presidential Election Results in the United States") +
geom_polygon(aes(x = long, y = lat, fill = state_join$candidate, group = group),
color = "white") +
coord_fixed(1.3) + scale_color_identity(
labels = c("Donald Trump", "Joe Biden"),
guide = "legend") + theme(legend.title = element_blank())
Color the map of the state of California by the winning candidate for each county. Note that some county have not finished counting the votes, and thus do not have a winner. Leave these counties uncolored.
cali= county %>% filter(region== 'california')
cali.winner=county.winner %>% filter(state=="California")
cali.winner$subregion= tolower(cali.winner$county)
cali_join= left_join(cali, cali.winner, by='subregion')
#map code
ggplot(data = cali_join) + ggtitle("Presidential Election Results in California") +
geom_polygon(aes(x = long, y = lat, fill = candidate, group = group),
color = "white") +
coord_fixed(1.3)
Create a visualization of your choice by using census data.
##california census (povery rates in different california counties)
cali_cen= census %>% select(State, County, TotalPop, Income, Poverty, ChildPoverty) %>% group_by(State, County) %>% filter(State=='California')
cali_cen= cali_cen[-c(1)]
#organizing dataframe 'cen_demo'
colnames(cali_cen)= c("subregion", "TotalPop", "Income", "Poverty", "ChildPoverty")
cali_cen$subregion= tolower(cali_cen$subregion)
cali_cen$subregion= gsub("\\s*\\w*$", "", cali_cen$subregion)
#joining dataframes
cen_join= left_join(cali, cali_cen, by='subregion')
#mapping
par(mfrow=c(2,1))
ggplot(data = cali) + ggtitle("Presidential Election Results in California") +
geom_polygon(aes(x = long, y = lat, fill = cali_join$candidate, group = group),
color = "white") +
coord_fixed(1.3) + scale_color_identity(
labels = c("Donald Trump", "Joe Biden"),
guide = "legend") + theme(legend.title = element_blank())
ggplot(data = cali) + ggtitle("Poverty in California") +
geom_polygon(aes(x = long, y = lat, fill = cen_join$Poverty, group = group),
color = "white") +
coord_fixed(1.3) +
guides("legend")
Using the census data set, I tried seeing if there was a correlation between the amount of poverty in certain counties in California to the county being predominantly Republican or Democratic. Overall, it generally seems as if the less poverty there is, the more chance there is of being Democratic. The higher the poverty rate, the county seems to be more Republican.
census.clean = na.omit(census) %>%
mutate(Men =100*(Men/TotalPop), Employed = 100*(Employed/TotalPop), VotingAgeCitizen= 100*(VotingAgeCitizen/TotalPop))
census.clean$Minority = census.clean$Hispanic+census.clean$Black+census.clean$Asian+census.clean$Pacific+census.clean$Native
drops = c("IncomeErr", "IncomePerCap", "IncomePerCapErr", "Walk", "PublicWork", "Construction","Hispanic","Black","Asian","Native","Pacific", "Women","ChildPoverty","Mino")
census.clean = census.clean[ , !(names(census.clean) %in% drops)]
head(census.clean,n=5)
## # A tibble: 5 x 25
## CountyId State County TotalPop Men White VotingAgeCitizen Income Poverty
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1001 Alab… Autau… 55036 48.9 75.4 74.5 55317 13.7
## 2 1003 Alab… Baldw… 203360 48.9 83.1 76.4 52562 11.8
## 3 1005 Alab… Barbo… 26201 53.3 45.7 77.4 33368 27.2
## 4 1007 Alab… Bibb … 22580 54.3 74.6 78.2 43404 15.2
## 5 1009 Alab… Bloun… 57667 49.4 87.4 73.7 47412 15.6
## # … with 16 more variables: Professional <dbl>, Service <dbl>, Office <dbl>,
## # Production <dbl>, Drive <dbl>, Carpool <dbl>, Transit <dbl>,
## # OtherTransp <dbl>, WorkAtHome <dbl>, MeanCommute <dbl>, Employed <dbl>,
## # PrivateWork <dbl>, SelfEmployed <dbl>, FamilyWork <dbl>,
## # Unemployment <dbl>, Minority <dbl>
I also decided to drop the variable Women as it can be easily predicted with our values of men and total population, thus making it useless in prediction. As for the remaining features, I decided to take the correlation of poverty and child poverty as the observations seem to follow very similar patterns in terms of slope. With a correlation of value .9328, I decided to get rid of it. Intuitively, this would make perfect sense as with most children being dependent on their parents' wealth or lack thereof it would make sense that they are highly correlated variables and almost perfectly colinear excluding some special cases. In addition, I took a look at the variables Minority and White and saw that these are also almost perfectly negatively correlated at a value of -.9973. With this, I decided to get rid of the Minority variable a well. I also decided to look at the correlation coefficeint between poverty and income as I thought there would be a strong negative correlation there. However, with a correlation value of -.7646 I did not feel like there was strong enough of a case to drop this variable. ## Problem 11
pr.out = prcomp(census.clean[,-c(1:3)], scale = TRUE, center = TRUE)
pc.county = data.frame(pr.out$rotation[,1:2])
arrange(pc.county,PC1)[,1:2]
## PC1 PC2
## Poverty -0.39519558 -0.11026775
## Unemployment -0.36333539 -0.08392321
## Minority -0.32163531 -0.20162620
## Service -0.22030132 -0.14424445
## Drive -0.12406425 0.41959144
## MeanCommute -0.09360250 0.16346541
## Production -0.09286140 0.29211482
## Office -0.08353357 0.15856107
## Carpool -0.06970064 -0.06612375
## OtherTransp -0.02425523 -0.22867662
## Men 0.02016044 -0.13401583
## TotalPop 0.02665276 -0.02773789
## VotingAgeCitizen 0.02848391 0.07441859
## Transit 0.04320677 -0.12909339
## PrivateWork 0.05534238 0.43346387
## FamilyWork 0.08671102 -0.20742929
## SelfEmployed 0.15606080 -0.30725057
## WorkAtHome 0.24107653 -0.32313951
## Professional 0.25540023 -0.17744244
## White 0.31803621 0.20680452
## Income 0.33349960 0.03586265
## Employed 0.37440250 0.05665303
Here, I decided to get rid of CountyId in addition to State and county because while CountyId represents numeric variables, they are predetermined labels for each county. We chose to center the features because this is required prior to taking the PCA. I also chose to scale the features as not all variables are on the same scale; therefore it is necessary. The features with the highest absolute loading values are Poverty, Employed and Unemployment from highest to lowest.
Looking at PC1 loading values, we see the features Poverty, Unemployment, Service, Drive, Production, MeanCommute, Office, Carpool, VotingAgeCitizen with negative loading values. On the otherhand, we see the features Employed, Income, Professional, WorkAtHome, White, SelfEmployed, FamilyWork, Transit, TotalPop, Men, PrivateWork and OtherTransp with positive loading values. This would imply some sort of negative correlation between these features. Looking at some of these values, this would make sense as Employed and Unemployment would obviously be negatively correlated as well as Income and Poverty.
pr.var = pr.out$sdev^2
plot(pr.var, xlab = "Principal Component",ylab= "Variance of Principle Component", ylim = c(0,200))
pve = pr.var/sum(pr.var)
plot(cumsum(pve), xlab="Principal Component ",
ylab=" Cumulative Proportion of Variance Explained ", ylim=c(0,1), type='b')
abline( h= .9, col = "red")
The number of principal components needed to capture 90% of the variance is 12.
set.seed(1)
census.dist = dist(census.clean[,-c(1:3)])
census.hclust = hclust(census.dist)
clus = cutree(census.hclust,10)
## dendrogram: branches colored by 10 groups
dend1 = as.dendrogram(census.hclust)
# color branches and labels by 3 clusters
dend1 = color_branches(dend1, k=10)
dend1 = color_labels(dend1, k=10)
# change label size
dend1 = set(dend1, "labels_cex", .5)
dend1 = set_labels(dend1, labels=census.clean$County[order.dendrogram(dend1)])
plot(dend1, horiz = T, main = "Dendrogram of Counties Based on Features(10 Clusters)")
plot(dend1[[2]][[2]][[2]][[2]][[2]][[2]][[2]][[1]][[1]], main = "Santa Barbara County Cluster Based on Features",horiz = T)
pc.score = data.frame(pr.out$x[,1:2])
pc.dist = dist(pc.score)
pc.hclust = hclust(pc.dist)
clus2 = cutree(pc.hclust, 10)
## dendrogram: branches colored by 10 groups
dend2 = as.dendrogram(pc.hclust)
# color branches and labels by 10 clusters
dend2 = color_branches(dend2, k = 10)
dend2 = color_labels(dend2, k = 10)
# change label size
dend2 = set(dend2, "labels_cex", .4)
dend2 = set_labels(dend2, labels=census.clean$County[order.dendrogram(dend2)])
plot(dend2, horiz = T)
After searching through the first dendrogram, I have identified Santa Barbary County to be in the pink cluster by looking through all the endpoints. This being the largest cluster, it does not tell us much about Santa Barbara county besides that it has a small distance to many counties based on each feature. In the dendrogram created through PC1 and PC2 score values, there is a more even split in clusters meaning that contrary to the first dendrogram, Santa Barbara is in a more specific cluster. ### Classification
# we move all state and county names into lower-case
tmpwinner <- county.winner %>% ungroup %>%
mutate_at(vars(state, county), tolower)
# we move all state and county names into lower-case
# we further remove suffixes of "county" and "parish"
tmpcensus <- census.clean %>% mutate_at(vars(State, County), tolower) %>%
mutate(County = gsub(" county| parish", "", County))
# we join the two datasets
election.cl <- tmpwinner %>%
left_join(tmpcensus, by = c("state"="State", "county"="County")) %>%
na.omit
# drop levels of county winners if you haven't done so in previous parts
election.cl$candidate <- droplevels(election.cl$candidate)
## save meta information
election.meta <- election.cl %>% select(c(county, party, CountyId, state, votes, pct, total))
## save predictors and class labels
election.cl = election.cl %>% select(-c(county, party, CountyId, state, votes, pct, total))
We excluded party as a predictor variable as it not a numeric vector.
set.seed(12)
n <- nrow(election.cl)
idx.tr <- sample.int(n, 0.8*n)
election.tr <- election.cl[idx.tr, ]
election.te <- election.cl[-idx.tr, ]
set.seed(20)
nfold <- 10
folds <- sample(cut(1:nrow(election.tr), breaks=nfold, labels=FALSE))
calc_error_rate = function(predicted.value, true.value){
return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=5, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso","rf","boosted")
set.seed(123)
election.tree = tree(candidate ~.,data=election.tr)
draw.tree(election.tree, nodeinfo = TRUE, cex = .5)
cv= cv.tree(election.tree, FUN=prune.misclass, rand = folds)
best_size= min(cv$size[cv$dev==min(cv$dev)])
pt.cv= prune.misclass(election.tree, best= best_size)
draw.tree(pt.cv, nodeinfo=TRUE, cex=0.5)
pred.pt.cv.te= predict(election.tree, election.te, type="class")
pred.pt.cv.tr = predict(pt.cv, election.tr, type = "class")
records[1,2]=calc_error_rate(pred.pt.cv.te, election.te$candidate)
records[1,1]=calc_error_rate(pred.pt.cv.tr, election.tr$candidate)
records
## train.error test.error
## tree 0.08100446 0.0987055
## logistic NA NA
## lasso NA NA
## rf NA NA
## boosted NA NA
The pruned tree has a slightly lower classified rate as it uses less variables in its deecision making. From the pruned tree, we can see that in terms of percentage of people in county is less than 1.15% who transit, Donald Trump has 83.6% of the votes. Moving down from here we notice that for those counties that have over 48.95% of the total population being white, there is a 91.7% chance that they voted for Trump here. We can also see that for those that have an Unemployment rate of higher than 6.75%, Joe Biden takes around 55.6% of these votes. On the right side of our decision tree, we can see that of the 379 counties that voted for Biden that have a Transit percentage of over 1.15, around 60.9% of these counties also voted for Biden if the Total Population was greater than 131021. Of the remaiining, 39.1% of counties that voted for Trump with a Total Population of less than 131021, those that have less than 18.95% of county population working in service have a 83.3% chance of voting for Trump. Moving further down from service, we see that of the 84 counties that voted for Trump, around 83.3% of those with less than a 45.15% working in professional field, 83.3% of those people votedd for Trump. Going down for Total Population again, we see that of those Counties, if teh White Population was greater than 80.5%, there was around a 16.5% chance of them voting for Donald Trump with the remainder voting for Joe Biden.
election.tr
## # A tibble: 2,469 x 23
## candidate TotalPop Men White VotingAgeCitizen Income Poverty Professional
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Joe Biden 68364 48.3 71.4 73.9 59684 13.3 41.8
## 2 Donald T… 134327 48.9 78.3 74.9 46213 17.4 29
## 3 Donald T… 10000 53.1 79.1 75.9 31933 26.2 23.8
## 4 Donald T… 310186 48.4 77.1 77.6 46475 17 27.6
## 5 Donald T… 34667 49.6 64.5 73.3 55793 13.2 32.7
## 6 Donald T… 10893 49.7 87.6 74.9 36893 17.3 26.9
## 7 Donald T… 68520 48.9 93.3 79.2 38266 17.2 27.7
## 8 Donald T… 37763 51.0 92.9 78.2 44837 15.6 27.1
## 9 Donald T… 17711 50.3 96.2 77.3 51114 10.4 27.6
## 10 Donald T… 8700 51.7 95.8 74.1 49135 15.7 24.1
## # … with 2,459 more rows, and 15 more variables: Service <dbl>, Office <dbl>,
## # Production <dbl>, Drive <dbl>, Carpool <dbl>, Transit <dbl>,
## # OtherTransp <dbl>, WorkAtHome <dbl>, MeanCommute <dbl>, Employed <dbl>,
## # PrivateWork <dbl>, SelfEmployed <dbl>, FamilyWork <dbl>,
## # Unemployment <dbl>, Minority <dbl>
sum(election.tr$Transit<1.15 & election.tr$White <48.95 & election.tr$Unemployment> 6.75 )
## [1] 135
glm_fit= glm(candidate ~ .,
data= election.tr, family= "binomial")
summary(glm_fit)
##
## Call:
## glm(formula = candidate ~ ., family = "binomial", data = election.tr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.10668 -0.24555 -0.10004 -0.03369 3.05940
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.738e+01 9.351e+00 -2.928 0.003416 **
## TotalPop 1.451e-06 6.050e-07 2.398 0.016485 *
## Men 4.255e-02 4.336e-02 0.981 0.326466
## White -1.429e-01 6.711e-02 -2.130 0.033207 *
## VotingAgeCitizen 1.473e-01 2.573e-02 5.723 1.05e-08 ***
## Income -1.634e-05 1.691e-05 -0.966 0.333915
## Poverty 2.824e-02 2.957e-02 0.955 0.339635
## Professional 3.189e-01 4.066e-02 7.844 4.38e-15 ***
## Service 3.623e-01 5.089e-02 7.119 1.09e-12 ***
## Office 1.519e-01 5.035e-02 3.017 0.002550 **
## Production 1.953e-01 4.248e-02 4.598 4.26e-06 ***
## Drive -2.063e-01 4.826e-02 -4.274 1.92e-05 ***
## Carpool -2.113e-01 6.191e-02 -3.413 0.000642 ***
## Transit -6.995e-02 1.026e-01 -0.682 0.495349
## OtherTransp -8.186e-03 9.979e-02 -0.082 0.934622
## WorkAtHome -8.245e-02 7.366e-02 -1.119 0.262980
## MeanCommute 5.083e-02 2.371e-02 2.144 0.032017 *
## Employed 2.502e-01 3.231e-02 7.745 9.59e-15 ***
## PrivateWork 6.757e-02 2.165e-02 3.121 0.001801 **
## SelfEmployed -4.823e-02 4.711e-02 -1.024 0.305939
## FamilyWork -3.323e-01 3.961e-01 -0.839 0.401486
## Unemployment 2.277e-01 4.595e-02 4.955 7.24e-07 ***
## Minority -1.530e-02 6.560e-02 -0.233 0.815526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2203.82 on 2468 degrees of freedom
## Residual deviance: 816.15 on 2446 degrees of freedom
## AIC: 862.15
##
## Number of Fisher Scoring iterations: 7
exp(coef(glm_fit))
## (Intercept) TotalPop Men White
## 1.288993e-12 1.000001e+00 1.043467e+00 8.668349e-01
## VotingAgeCitizen Income Poverty Professional
## 1.158650e+00 9.999837e-01 1.028639e+00 1.375603e+00
## Service Office Production Drive
## 1.436622e+00 1.164084e+00 1.215721e+00 8.136014e-01
## Carpool Transit OtherTransp WorkAtHome
## 8.095099e-01 9.324388e-01 9.918479e-01 9.208533e-01
## MeanCommute Employed PrivateWork SelfEmployed
## 1.052142e+00 1.284294e+00 1.069906e+00 9.529106e-01
## FamilyWork Unemployment Minority
## 7.172660e-01 1.255700e+00 9.848123e-01
pred.log.te= predict(glm_fit, newdata = election.te, type="response")
pred.log.labeled.te = ifelse(pred.log.te < .5, "Donald Trump","Joe Biden")
pred.log.tr = predict(glm_fit, election.tr, type = "response")
pred.log.labeled.tr = ifelse(pred.log.tr < .5, "Donald Trump","Joe Biden")
records[2,2]=calc_error_rate(pred.log.labeled.te, election.te$candidate)
records[2,1]=calc_error_rate(pred.log.labeled.tr, election.tr$candidate)
The significant variables in this model are TotalPop, White, VotingAgeCitizen, Poverty, Professional, Service, Office, Production, Drive, Carpool, Employed, MeanCommute, Employed, PrivateWork and Unemployment. There are many more significant variables here, than in our pruned decision tree. In analyzing the coefficient for the White variable, I will take exponentiating the coefficient getting a value of .88036. I will interpret this as a decrease of around 12% in votes for Biden as White increases by a percent. For those that are unemployed, there seems to be around a 25% increase for each percentage increase in unemployment. ## Problem 17
set.seed(5)
lambda.lasso = seq(1, 50) * 1e-4
x.train = as.matrix(election.tr[,-1])
y.train = election.tr$candidate
x.test = as.matrix(election.te[,-1])
lasso.mod <- cv.glmnet(x.train, y.train, alpha=1,lambda = seq(1, 50) * 1e-4,family = "binomial")
plot(lasso.mod)
abline(v = log(lasso.mod$lambda.min), col="red", lwd=3, lty=2)
bestlam =lasso.mod$lambda.min
lasso.coef=predict(lasso.mod,type="coefficients",s=bestlam)[2:22,]
bestlam
## [1] 0.0012
lasso.coef
## TotalPop Men White VotingAgeCitizen
## 1.569020e-06 0.000000e+00 -1.162029e-01 1.496512e-01
## Income Poverty Professional Service
## 0.000000e+00 4.079363e-02 2.390442e-01 2.798460e-01
## Office Production Drive Carpool
## 8.144856e-02 1.205952e-01 -1.357798e-01 -1.294359e-01
## Transit OtherTransp WorkAtHome MeanCommute
## 0.000000e+00 3.886652e-02 -2.096341e-03 1.952859e-02
## Employed PrivateWork SelfEmployed FamilyWork
## 2.145291e-01 5.793665e-02 -6.509845e-02 -2.725748e-01
## Unemployment
## 1.990332e-01
lasso.pred.train = predict(lasso.mod, s = bestlam, newx = x.train,type = "response")
lasso.pred.train.labeled = ifelse(lasso.pred.train < .5, "Donald Trump","Joe Biden")
lasso.pred.test = predict(lasso.mod, s = bestlam, newx = x.test,type = "response")
lasso.pred.test.labeled = ifelse(lasso.pred.test < .5, "Donald Trump","Joe Biden")
records[3,2]=calc_error_rate(lasso.pred.test.labeled, election.te$candidate)
records[3,1]=calc_error_rate(lasso.pred.train.labeled,election.tr$candidate)
The best lambda value in this case is .0011. In comparison to a logistic model, the lasso model has a slightly lower training error and a much lower test error. The non-zero coefficient values for this best lambda value are Total Population, White, Voting Age Citizen, Poverty, Professional, Service, Office, Production, Drive, Carpool, Other Transp, WorkAtHome, MeanCommute, Employed, PrivateWork, SelfEmployed, FamilyWork and Unemployment
election_cand = ifelse(election.te$candidate == "Donald Trump",0,1)
pred.pt.cv.te.prob = predict(pt.cv, election.te)
tree.pred = prediction(pred.pt.cv.te.prob[,2], election_cand)
perf.tree = performance(tree.pred, measure = "tpr", x.measure = "fpr")
log.pred = prediction(pred.log.te, election_cand)
perf.log =performance(log.pred, measure = "tpr", x.measure = "fpr")
lasso.pred = prediction(lasso.pred.test, election_cand)
perf.lasso =performance(lasso.pred, measure = "tpr", x.measure = "fpr")
plot(perf.tree, col = 2, lwd =2,main = "ROC Curves")
abline(0,1)
par(new=TRUE)
plot(perf.log, col = 3, lwd = 2)
par(new = TRUE)
plot(perf.lasso, col = 4, lwd = 2)
legend("bottomright",legend = c("Pruned Decision Tree","Logistic Model","Lasso Model"), col = c(2:4),lty =1)
#RF model
set.seed(5)
rf.election.train = randomForest(candidate~.-candidate, data = election.tr,importance = TRUE,ntree = 40)
rf.pred.te= predict(rf.election.train, newdata = election.te,type = "prob")
rf.pred.tr = predict(rf.election.train, newdata = election.tr,type = "prob")
election.rf.labeled.te = ifelse(rf.pred.te[,1]>.5, "Donald Trump","Joe Biden")
election.rf.labeled.tr = as.factor(ifelse(rf.pred.tr[,1]>.5, "Donald Trump","Joe Biden"))
records[4,2]=calc_error_rate(election.rf.labeled.te, election.te$candidate)
records[4,1]=calc_error_rate(election.rf.labeled.tr,election.tr$candidate)
rf.pred = prediction(rf.pred.te[,2], election.te$candidate)
perf.rf = performance(rf.pred, measure = "tpr", x.measure = "fpr")
#boosted model
set.seed(4)
election.boost.training = gbm(ifelse(candidate=="Donald Trump", 0, 1)~. ,data = election.tr, n.trees = 1000, shrinkage = .1, distribution = "bernoulli")
summary(election.boost.training)
## var rel.inf
## White White 24.0170037
## Transit Transit 22.5487385
## TotalPop TotalPop 12.5190736
## Professional Professional 7.5282181
## Employed Employed 5.0302233
## VotingAgeCitizen VotingAgeCitizen 3.5396303
## Minority Minority 3.5165478
## Income Income 3.4369720
## Service Service 3.3126322
## Unemployment Unemployment 2.4710346
## Men Men 2.3080618
## Poverty Poverty 2.0548538
## Production Production 2.0252723
## OtherTransp OtherTransp 0.9241050
## SelfEmployed SelfEmployed 0.9122087
## PrivateWork PrivateWork 0.8934608
## Drive Drive 0.8247592
## Office Office 0.6817054
## WorkAtHome WorkAtHome 0.5504330
## Carpool Carpool 0.4870107
## MeanCommute MeanCommute 0.3613464
## FamilyWork FamilyWork 0.0567090
election.boost.tr = predict(election.boost.training, newdata = election.tr, type = "response")
## Using 1000 trees...
election.boost.te = predict(election.boost.training, newdata = election.te, type = "response")
## Using 1000 trees...
election.boost.tr.labeled = ifelse(election.boost.tr < .5, "Donald Trump", "Joe Biden")
election.boost.te.labeled = ifelse(election.boost.te < .5, "Donald Trump", "Joe Biden")
records[5,1]=calc_error_rate(election.boost.tr.labeled,election.tr$candidate)
records[5,2]=calc_error_rate(election.boost.te.labeled,election.te$candidate)
boost.pred = prediction(election.boost.te, election.te$candidate)
perf.boost = performance(boost.pred, measure = "tpr", x.measure = "fpr")
#All ROC Curves, AUC values and Records
plot(perf.tree, col = 2, lwd =2,main = "ROC Curves")
abline(0,1)
par(new=TRUE)
plot(perf.log, col = 3, lwd = 2)
par(new = TRUE)
plot(perf.lasso, col = 4, lwd = 2)
par(new = TRUE)
plot(perf.rf, col = 5, lwd = 2)
par(new = TRUE)
plot(perf.boost, col = 6, lwd = 2)
legend("bottomright",legend = c("Pruned Decision Tree","Logistic Model","Lasso Model","RF Model","Boosted Model"), col = c(2:6),lty =1, cex = .5)
auc.tree = performance(tree.pred, "auc")@y.values
auc.log = performance(log.pred, "auc")@y.values
auc.lasso = performance(lasso.pred, "auc")@y.values
auc.rf = performance(rf.pred, "auc")@y.values
auc.boost = performance(boost.pred, "auc")@y.values
cbind(c("Pruned Decision Tree","Logistic Model","Lasso Model","RF Model","Boosted Model"), c(auc.tree,auc.log,auc.lasso,auc.rf,auc.boost))
## [,1] [,2]
## [1,] "Pruned Decision Tree" 0.8570153
## [2,] "Logistic Model" 0.9556515
## [3,] "Lasso Model" 0.958281
## [4,] "RF Model" 0.950157
## [5,] "Boosted Model" 0.9560243
records
## train.error test.error
## tree 0.0810044552 0.09870550
## logistic 0.0619684083 0.07928803
## lasso 0.0627784528 0.07119741
## rf 0.0004050223 0.06472492
## boosted 0.0287565816 0.06472492
Looking at the AUC values for these curves, we see that the Lasso model has the highest AUC value. While rf and boosted models seem to give us a lower test error, this could be due to overfitting in our model. The random forest and boosted models have lower test errors than the decision tree, logistic and lasso models. So in terms of accuracy they seem better. However, in terms of proportion of true positives and negatives, the lasso model seems to do the best.
#calculate swing counties
swing=county.winner %>% filter(pct <= 0.525 & pct >= 0.475)
swing
## # A tibble: 496 x 7
## # Groups: state, county [493]
## state county candidate party votes total pct
## <chr> <chr> <fct> <fct> <dbl> <dbl> <dbl>
## 1 New Hampshire Acworth Joe Biden DEM 293 569 0.515
## 2 Idaho Ada Donald Trump REP 130699 259389 0.504
## 3 Vermont Addison Joe Biden DEM 445 870 0.511
## 4 Massachusetts Agawam Donald Trump REP 8021 16069 0.499
## 5 Colorado Alamosa Donald Trump REP 3676 7519 0.489
## 6 Wyoming Albany Joe Biden DEM 9091 18612 0.488
## 7 Vermont Alburgh Joe Biden DEM 550 1079 0.510
## 8 Maine Alfred Joe Biden DEM 983 1995 0.493
## 9 Connecticut Andover Joe Biden DEM 1058 2091 0.506
## 10 Minnesota Anoka Donald Trump REP 104894 210474 0.498
## # … with 486 more rows
swing$subregion= tolower(swing$county)
county_join= left_join(swing, county, by= 'subregion')
ggplot(data = county_join) +
geom_polygon(aes(x = long, y = lat, fill = subregion, group = group),
color = "white") +
coord_fixed(1.3) +
guides(fill=FALSE)
hi=table(swing$candidate)
barplot(hi)
This election was very equally polarized, which made it difficult to predict the future president. We determined if a county was a "swing" county if the difference between the votes for Biden and Trump was 5% or less.
Using this information, there were 496 "swing" counties, with no particular inclination towards a candidate. 238 counties had a 5% or less inclination towards Donald Trump, while 258 counties were slightly learning towards Joe Biden. Out of the 2825 unique counties, 17.56% of the counties were considered "swing" counties in our calculations. If we chose the "swing" counties as the majority vote being between 45% and 55%, the swing counties would have consisted of 30%.
Looking at the map of the swing counties, the swing counties are evenly spread out across the country. Because of this, the decision of who the electoral votes goes to is not certain, because it is not just one state that is unsure of their preferred candidate.
Ultimately, because of how spread out and how many swing counties there are in the data, it is difficult to predict which presidential candidate the counties preferred.
On top of this, due to COVID-19 there were allowed mail-in ballots that were taken in until November 3, 2020. The time it takes for the mail-in ballots to arrive at poll stations could have been up to 1 week, so there is even more uncertainty added onto this data.
Based on our multiple models that we included in our analysis, we saw that the AUC value for the lasso model seemed to be the highest of all the models we chose. Interpreting this Lasso Model, the significant variables that we saw had an effect on the voting outcome were those such as Total Population, White, Voting Age Citizen, Poverty, Professional, Service, Office, Production, Drive, Carpool, Other Transp, WorkAtHome, MeanCommute, Employed, PrivateWork, SelfEmployed FamilyWork and Unemployment. All the other variables that had coefficients of 0 could have caused overfitting in the model. To visualize the errors on the map, I will be adding the predicted test election results for the lasso model to the samples test election dataset.
records[3,2]
## [1] 0.07119741
We saw the error of this to be .07119 which is a solid test error value with a lower training error value that makes sense. I feel like the significant variables chosen in this model make a lot of sense such as Professional, Poverty and Unemployment as these were variables that we heard of take a large role in our recent election. However, these insights are not enough as there were clearly more variables taken into account that had significance on our test data set for the lasso model. This would imply a lack of undersstanding and a need to expand our domain knowledge as I was not able to completely understand why the data was influenced by these variables completely.